Skip to content

Provide host-level broadcast#129

Merged
maleadt merged 6 commits into
mainfrom
tb/broadcast
Mar 20, 2026
Merged

Provide host-level broadcast#129
maleadt merged 6 commits into
mainfrom
tb/broadcast

Conversation

@maleadt

@maleadt maleadt commented Mar 20, 2026

Copy link
Copy Markdown
Member

Implementing #121 (comment)

maleadt and others added 2 commits March 20, 2026 10:46
Implements `ct.Tiled(B) .= A .+ A .* B` syntax that leverages Julia's
broadcast fusion machinery and dispatches to cuTile kernels.

- `Tiled` wrapper type with `TiledCuArrayStyle` that wins over
  `CuArrayStyle` and `DefaultArrayStyle`
- `materialize!` converts the fused `Broadcasted` tree: CuArrays become
  TileArrays, style/axes are stripped
- Generic 1D/2D kernels recursively evaluate the `Broadcasted` tree on
  tiles, using `broadcast(bc.f, args...)` for element-wise semantics
- Supports arbitrarily nested fused expressions (e.g. `A .+ A .* B`)

Type-constructor broadcasts (e.g. `BFloat16.(A)`) are not yet supported
due to `Type{T}` fields causing compilation issues.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace separate 1D/2D broadcast kernels with a single @generated
kernel that handles arbitrary dimensionality, matching the @fuse
macro's bid-construction pattern for N>3 grid delinearization.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AntonOresten

Copy link
Copy Markdown
Collaborator

Is there a more robust way to do tile sizes? I did (64, 64, 1, ...) with 64 in the first dim to get coalesced access for contiguous arrays, and 64 in the second dim mostly to spread it out in case the first dim wasn't large. TileArray already specializes on stride and size divisibility iirc, so maybe sizes and singletons can be taken into account. Just imagine the target array having leading singleton dims!

maleadt and others added 2 commits March 20, 2026 11:33
Replace the hardcoded 64×64 tile sizes with a greedy budget-based
approach (4096 elements) that skips singleton dimensions and caps each
tile dim at the array size. This fixes tiled broadcast for arrays with
leading singleton or small dimensions (e.g. (1, 1024), (4, 1024)).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@maleadt

maleadt commented Mar 20, 2026

Copy link
Copy Markdown
Member Author

I pushed something simple, maintaining a constant "element budget per tile", inspired by cuTile Python: https://github.com/NVIDIA/cutile-python/blob/7fb3407556e3d96759107d1bdaa01023101c44f0/samples/VectorAddition.py#L231-L236

@maleadt maleadt marked this pull request as ready for review March 20, 2026 11:49
Comment thread src/cuTile.jl Outdated
Comment thread src/cuTile.jl Outdated
@maleadt maleadt merged commit 00e7a54 into main Mar 20, 2026
9 checks passed
@maleadt maleadt deleted the tb/broadcast branch March 20, 2026 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants